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ABSTRACT  We  consider  the  problem  of  classifying  an  unknown  probability  distribution  based  on  a 
sequence  of  random  samples  drawn  according  to  this  distribution.  Specifically,  if  A  is  a  subset  of  the  space 
of  all  probability  measures  A4i(E)  over  some  compact  Polish  space  E,  we  want  to  decide  whether  or  not 
the  unknown  distribution  belongs  to  A  or  its  complement.  We  propose  an  algorithm  which  leads  a.s.  to 
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1  Introduction 


In  this  paper,  we  consider  the  problem  of  classifying  an  unknown  probability  distribution  into  one  of  a 
finite  or  countable  number  of  classes  based  on  random  samples  drawn  from  the  unknown  distribution.  This 
problem  arises  in  a  number  of  applications  involving  classification  and  statistical  inference.  For  example, 
consider  the  following  problems: 

1.  Given  i.i.d.  observations  £j,£2, . . .  from  some  unknown  distribution  P,  we  wish  to  decide  whether 
the  mean  of  P  is  in  some  particular  set  (e.g.,  in  some  interval  or  whether  the  mean  is  rational,  etc.). 

2.  Given  i.i.d.  observations  £1,3:2,...,  we  wish  to  decide  whether  or  not  the  unknown  distribution 
belongs  to  a  particular  parametric  class  (e.g.,  to  determine  if  it  is  Gaussian)  or  to  determine  to 
which  of  a  countable  hierarchy  of  classes  the  unknown  distribution  belongs  (e.g.,  to  determine  class 
membership  based  on  some  smoothness  parameter  of  the  density  function). 

3.  We  wish  to  decide  whether  or  not  observations  £i,£2,  •  •  •  are  coming  from  a  Markov  source,  and  if 
so  to  determine  the  order  of  the  Markov  source. 

In  these  examples,  our  goal  is  to  decide  whether  an  unknown  distribution  /i  belongs  to  a  set  of  distri¬ 
butions  A  or  its  complement  Ac,  or  more  generally  to  decide  to  which  of  a  countable  collection  of  sets  of 
distributions  Ai,  A2, . . .  the  unknown  \t  belongs.  After  each  new  observation  xn  we  will  make  a  decision  as 
to  the  class  membership  of  the  unknown  distribution.  Our  criterion  for  success  is  to  require  that  almost 
surely  only  a  finite  number  of  mistakes  are  made.  There  are  two  aspects  to  the  “almost  sure”  criterion. 
First,  as  expected,  we  require  that  with  probability  one  (with  respect  to  the  observations  £1,  £2,.  •  •)  our 
decision  will  be  correct  from  some  point  on.  However,  depending  on  the  structure  of  the  A,-  classification 
may  be  difficult  for  certain  distributions  fi.  Hence,  given  a  measure  on  the  set  of  distributions  we  allow 
failure  (i.e.,  do  not  require  a  finite  number  of  mistakes)  on  a  set  of  distributions  of  measure  zero. 

Our  work  is  motivated  by  the  previous  work  of  Cover  (1973),  Koplowitz  (1977),  and  Kulkarni  and 
Zeitouni  (1991).  In  fact,  the  previous  works  just  mentioned  deal  with  the  specific  case  in  which  the 
unknown  distribution  is  to  be  classified  according  to  its  mean  based  on  i.i.d.  observations,  as  in  the 
example  problem  1  above.  In  this  case,  a  subset  of  M  can  be  identified  with  the  set  of  distributions  A 
in  the  natural  way  (i.e.,  all  distributions  whose  mean  is  in  a  specified  set).  Cover  (1973)  considered  the 
case  of  distributions  on  [0, 1]  with  A  —  Q[o,i],  the  set  of  rationals  in  [0, 1],  and  more  generally  the  case  of 
countable  A.  He  provided  a  test  which,  for  any  measure  with  mean  in  A  or  with  mean  in  AC\N ,  will  make 
(almost  surely)  only  a  finite  number  of  mistakes  where  A  is  a  set  of  Lebesgue  measure  0.  For  countable 
A,  Cover  also  considered  the  countable  hypothesis  testing  problem  of  deciding  exactly  the  true  mean  in 
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the  case  the  true  mean  belongs  to  A,  and  provided  a  decision  rule  satisfying  a  similar  success  criterion. 
Koplowitz  (1977)  showed  some  properties  of  sets  A  which  allow  for  such  decision  rules  and  gave  some 
characterizations  of  the  set  N.  For  example,  he  showed  that  if  A  (the  closure  of  A)  is  countable  then  N  is 
empty,  while  if  A  is  uncountable  then  N  is  uncountable.  Kulkarni  and  Zeitouni  (1991)  extended  the  results 
of  Cover  (1973)  by  allowing  the  set  A  to  be  uncountable,  not  necessarily  of  measure  0,  but  such  that  it 
satisfies  a  certain  structural  assumption.  Roughly  speaking,  this  structural  assumption  requires  that  A  be 
decomposable  into  a  countable  union  of  increasing  sets  Bm  such  that  a  small  dilation  of  Brn  increases  the 
Lebesgue  measure  by  only  a  sufficiently  small  amount.  In  a  different  direction,  Dembo  and  Peres  (1991) 
provide  necessary  and  sufficient  conditions  for  the  almost  sure  discernibility  between  sets.  Their  results, 
when  specialized  to  the  set-up  discussed  above,  show  that  the  inclusion  of  the  possibility  of  some  errors  on 
the  set  of  irrationals  is  necessary  in  order  to  ensure  discernibility. 

The  decision  rules  of  [4,  10,  11,  5]  are  basically  as  follows.  At  time  n,  the  smallest  m  is  selected 
such  that  the  observations  are  suffiently  well-explained  by  a  hypothesis  in  Bm.  If  m  is  not  too  large,  we 
decide  that  the  unknown  distribution  belongs  to  A;  otherwise  we  decide  Ac.  For  the  case  of  countable 
hypothesis  testing,  a  similar  criterion  is  used.  Thus,  the  Bm  can  be  thougtht  of  as  a  decomposition  of  A 
into  hypotheses  of  increasing  complexity  and  so  the  decision  rules  are  reminiscent  of  Occam’s  razor  or  the 
MDL  (Minimum  Description  Length)  principle. 

The  problem  considered  in  this  paper  uses  a  success  criterion  and  decision  rules  very  similar  to  those 
in  the  previous  work  of  [4,  10,  11],  but  allows  much  more  general  types  of  classification  of  the  unkown 
distribution.  Section  2  treats  the  case  of  classification  in  A  versus  Ac  for  distributions  on  an  arbitrary 
compact  complete  separable  metric  space  (i.e.,  a  compact  Polish  space)  with  i.i.d.  observations.  The  case 
of  classification  among  a  countable  number  of  sets  Ai,A2,...  from  i.i.d.  observations  is  considered  in 
Section  3.  Thus,  the  results  of  these  two  sections  cover  the  example  problems  1  and  2  mentioned  above. 
Furthermore,  we  also  consider  relaxations  of  the  basic  assumption  concerning  the  i.i.d.  structure  of  the 
observations  x\, . . . ,  xn.  Namely,  results  for  observations  with  Markov  dependence  are  presented  in  Section 
4.  In  particular,  we  treat  example  problem  3  on  the  determination  of  the  order  of  a  Markov  chain. 

We  now  give  a  precise  formulation  of  the  problems  considered  here.  Let  x\,.,.,xn  be  i.i.d.  samples 
drawn  from  some  distribution  /j,  (as  mentioned,  Markov  dependence  will  be  considered  in  Section  4).  We 
assume  that  xi  takes  values  in  some  compact  Polish  space  E,  which  for  concreteness  should  be  thought 
of  as  [0,  l]d  C  Md.  Let  Afi(E)  denote  the  space  of  probability  measures  on  E.  We  put  on  A4i(E)  the 
Prohorov  metric,  denoted  d(-,  •),  whose  topology  is  equivalent  to  the  weak  topology. 

We  consider  here  the  following  problems: 
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P-1)  Based  on  the  sequence  of  observations  (xlf . .  .,£„),  decide  whether  p  €  A  or  p  €  Ac,  where  A  is 
some  given  set  satisfying  certain  structural  properties  (c.f.  A-l  below). 

P-2)  Based  on  the  sequence  of  observations  (*i, . . . ,  xn),  decide  whether  p  6  A,-  where  all  A,-  C  Afi(S), 
i  =  1,2, .. .  are  sets  satisfying  structural  properties  (c.f.  A-l  below). 

Since  Adi(S)  is  a  Polish  space,  there  exist  on  Mi(T.)  many  finite  measures  which  we  may  assume  to  be 
norma.li7.pfl  to  have  a  total  mass  1.  Suppose  one  is  given  a  particular  measure,  denoted  G,  on  Ati(S).  In 
particular,  we  allow  G  to  charge  all  open  sets  in  Afi(S).  G  will  play  the  role  of  the  Lebesgue  measure  in 
the  following  structural  condition,  which  is  reminiscent  of  the  assumption  in  Kulkarni  and  Zeitouni  (1991): 

A-l)  There  exists  a  sequence  of  open  sets  Cm  C  Afi(S)  and  closed  sets  Bm  C  Adi(S),  and  a  sequence  of 
positive  constants  e(m)  such  that: 

1)  'ip  G  A  3m0(/i)  <  oo  s.t.  Vro  >  m0(p),p  €  Bm. 

2)  d(Bm,Ccm)  =  J Hm)>0. 

3)  G(n~  1  Um=n(^mV^y)\A))  =  0  where  =  {v  €  Mi(£)  |  d(v,Cm)  <  is  the 

\/2c(m)  dilation  of  Cm. 

A-l)  is  an  embellishment  of  the  structural  assumption  in  Kulkarni  and  Zeitouni  (1991),  which  corre¬ 
sponds  to  the  case  where  Bm  is  a  monotone  sequence  and  Cm  are  taken  as  the  \J 2c(ra )  dilation  of  Bm. 
The  use  of  A-l)  1)  and  A-l)  2)  was  proposed  to  us  by  A.  Dembo  and  Y.  Peres,  who  obtained  also  various 
conditions  for  full  discernibility  between  hypotheses,  c.f.  Dembo  and  Peres  (1991).  We  note  that  as  in 
Kulkarni  and  Zeitouni  (1991),  the  assumption  is  immediately  satisfied  for  countable  sets  A  by  taking  as 
Bm  the  union  of  the  first  m  components  of  A  and  noting  that,  for  a  finite  measure  on  a  metric  space, 
G(B(x,8)\{x})  -*s-> o  0  where  B(x,6)  denotes  the  open  ball  of  radius  8  around  x.  More  generally,  A-l)  is 
satisfied  for  any  closed  set  by  taking  Bm  =  A  and  using  for  Cm  a  sequence  of  open  sets  which  include  A 
whose  measure  converges  to  the  outer  measure  of  A.  Since  Cm  is  open  and  S  is  compact,  it  follows  that 
d(A,C5n)  >  0,  and  A-l)  is  satisfied.  By  the  same  considerations,  it  also  follows  that  A-l)  is  satisfied  for 
any  countable  union  of  closed  sets.  Also,  note  that  whenever  both  A  =  U^jA,-  and  Ac  =  with 

Ai,Di  closed  then,  choosing  Bm  =  U^A;  and  Cm  =  one  sees  that  A-l)  holds  (with  actually  an 

empty  intersection  in  A-l)  3)  for  appropriate  c(m)).  In  this  situation,  the  results  of  this  paper  correspond 
to  the  sufficient  part  of  Dembo  and  Peres  (1991). 
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2  Classification  in  A  versus  Ac 


The  definition  of  success  of  the  decision  rule  will  be  similar  to  the  one  used  in  Kulkarni  and  Zeitouni  (1991). 
Namely,  a  test  which  makes  at  each  instant  n  a  decision  whether  fi  6  A  or  fi  £  Ac  based  on  x\, . .  .,xn  will 
be  called  successful  if: 

(5.1)  V n  €  A,  a.s.  u>,  3  T(u> )  s.t.  V  n  >  T(u),  the  decision  is  ‘ A 

(5.2)  3  N  C  Mi(S)  s.t. 

(5.2.1)  G(N)  =  0 

(5.2.2)  Vji  €  AC\N,  a.s.  w,  3  T(u)  s.t.  V  n  >  T(u),  the  decision  is  iAc\ 

Note  that  the  outcome  is  unspecified  on  N .  Note  also  that  the  definition  is  asymmetric  in  the  roles 
played  by  A,  Ac  in  the  sense  that  errors  in  A  are  not  allowed  at  all. 

n 

Let  fin  =  i  8Xi.  We  recall  that  j±n  satisfies  a  large  deviation  principle,  i.e. 
i= 1 

-  inf  H(0\n)  <  liminfn_voo  —  log  P(/i„  €  A)  <  limsup^^  £  log  P(nn  €  A) 

$€A  n  (2.1) 

<  -  inf  H(6\u)  ’ 

$eA°  v  ’ 

where  A  (A0)  denote  the  closure  (interior)  of  a  set  A  C  Afi(S)  in  the  weak  topology,  respectively,  and 

*(%)={  LMHT»  'iS<<,‘  (2.2) 

(  oo  otherwise 

Our  decision  rule  is  very  similar  to  that  in  Kulkarni  and  Zeitouni  (1991).  Specifically,  we  parse  the 
input  sequence  *i,  *2, ...  to  form  the  subsequences 


X  (•^/3(m— l)+l  1  ‘  ‘  "  5  ®/3(m)) 


(2.3) 


where  the  choice  of  the  f3(m )  will  be  given  below.  The  length  of  the  sequence  Xm  will  be  denoted  by  a(m), 
so  that 

m 

P(m)  =  £a(*)>.  0(0)  =  0.  (2.4) 

i=l 

We  will  specify  the  P(m)  by  appropriately  selecting  the  lengths  a(rn)  of  the  subsequences. 

At  the  end  of  each  subsequence  Xm,  we  form  the  empirical  measure  fxxm  based  on  the  data  in  the 
subsequence  Xm.  Namely, 

1 


p(m) 

V-X™  =  — 7— r  T2 

°(m)  i=^- i)+i 


(2.5) 
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Then  we  make  a  decision  of  whether  /x  €  A  or  fx  €  Ac  according  to  whether  nxm  €  Cm  or  not.  Between 
parsings,  we  do  not  change  the  decision. 

Recall  that  from  the  structural  assumption  A-l),  is  y/2e(m)  separated  from  Bm.  Our  idea  is  to 
choose  a(m )  sufficiently  large  such  that  if  the  true  measure  fx  is  in  Bm,  then  we  will  have  enough  data  in 
forming  the  empirical  measure  fix™  to  make  the  probability  of  an  incorrect  decision  (deciding  Ac  because 
HXn  €  C!^)  less  than  1/to2.  If  a(m)  can  be  chosen  in  this  manner,  then  for  any  fx  €  A,  once  to  >  too(m) 
our  probability  of  error  at  the  end  parsing  interval  to  is  less  than  1/to2  so  that  by  the  Borel-Cantelli  lemma 
we  make  only  finitely  many  errors. 

To  show  that  a(m )  can  be  chosen  to  satisfy  the  necessary  properties,  we  will  need  a  strengthened 
version  of  the  upper  bound  in  Sanov’s  theorem  (2.1).  To  do  that,  we  use  the  notion  of  covering  number: 

Definition  Let  e  >  0  be  given.  The  covering  number  of  Afi(S),  denoted  iV(e,  Afi(S)),  is  defined  by 

lV(e,Xi(S))  =  inf  {n\3yi, . .  .,yn  G  Afi(S)  s.t.  B  C  U"=1  B(yi,e)}  (2.6) 

where  B(y,  e)  denotes  a  ball  of  radius  e  (in  the  Prohorov  metric)  around  y. 

Similarly,  for  any  given  e,  denote  by  ArS(e)  the  covering  number  of  S,  i.e. 

iVs(e)  =  inf  {n\3yi,...,yn  G  S  s.t.  S  C  U-=1  5(|/i,f)}.  (2.7) 


where  B(yi,e )  are  taken  in  the  metric  corresponding  to  E. 
We  claim  now: 


Lemma  1 

JV(€,M(£))<2^-J  =N(€,Mi(  E))  (2.8) 

Proof  In  order  to  prove  the  lemma,  we  will  explicitly  construct  an  e-cover  of  AIi(E)  with  N(e,M i(S)) 
elements. 


Let  yi,. .  -5Hjvs(£)  be  the  centers  of  a  set  of  e  balls  in  E  which  create  the  cover  iVs(e)  in  (2.7).  Let 
Si  =  6yt ,  i.e.  the  distribution  concentrated  at  Hi,  and  let 


j  = 


0,1,.. 


*  5 


lVs(e) 


Define  Y  =  {y  €  A4i(E)  :  3  (»i,  ji)  ■  ■  -(h,  jk)  s.t.  y  =  ^  Mi“}-  Note  that  Y  is  a  finite  set,  for 

<2=1 

it  includes  at  most  ^  +  l)  members.  Also,  note  that  Y  is  an  e-cover  of  Afi(E),  i.e.  for  any 
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H  £  Mi(S)  there  exists  a  y  £Y  such  that  for  any  open  set  C  C  £,  n(C)  <  y(Cc)  +  e.  To  see  that,  choose 
as  y  the  following  approximation  to  ji: 

Let  ia  =  a,  a  =  1, . . . ,  JVs(e),  and  choose  ja  =  [p.  (£(£„,  e)\  (u^T*  £(?/*,  c)))  j  where  by  [xj  we 

JVs(c)-l 

mean  the  closest  approximation  to  x  on  the  —■&  j- net  from  below.  Finally,  let  ijvs(e)  =  ~c —  XZ  ja- 


a=l 


JV=(e) 


Take  now  j/  =  X3  •  It  follows  that  y  is  a  probability  measure  based  on  a  finite  number  of  atoms  and, 


Of=l 


furthermore,  d(y ,  fi)  <  e.  We  need  therefore  only  to  estimate  the  cardinality  of  the  set  Y ,  denoted  |Yj.  Note 

N*(e) 

that  \Y\  is  just  the  number  of  vectors  (jx, . . . ,  j^s^))  such  that  X^  J*  =  1  and  j;  €  {0,  ^j,  . . . ,  1}. 


i—1 


It  follows  that 


|y|  <  +  i  £  ...  J” 


«) 


N2(e)! 


However,  by  Stirling’s  formula 


log  (iVs(e)!)  >  iVs(c)  log  iVs(e)  -  iVs(e) 


(2.9) 


(2.10) 


Substituting  (2.10)  into  (2.9),  one  has 

M  <  + 1) 

which  implies  that 


NE(e) 


„N*(e) 


(iVE(e)fE(£) 


N(e,Mi(?))  <  {l 


c) 


(f)NE(£)  (l  +  _^_)^rE(e)  <  2  (f)"=«  =  iV(c,Mx(S)) 


(2.11) 


□ 


For  completeness,  we  show  in  the  Appendix  a  complementary  lower  bound  on  the  covering  number 
which  exhibits  a  behavior  similar  to  N.  Thus,  the  upper  bound  N  cannot  be  much  improved. 

The  existence  of  the  bound  N  permits  us  to  mimic  the  computation  in  Kulkarni  and  Zeitouni  (1991) 
for  the  case  in  hand.  Indeed,  a  crucial  step  needed  is  bounding  the  probability  of  complements  of  balls,  for 
all  n,  uniformly  over  all  measures,  as  follows: 

Theorem  1 

P(fin  e  B(fi,  6)c)  <  N  (|,Afx(S))  e-"(f)2 
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Proof  The  proof  follows  the  standard  Chebycheff  bound  technique,  without  taking  n  limits  as  in  the 
large  deviation  framework.  Indeed, 

P(fin  e  B(n,S)c)  <  N  ■  sup  P(fin  e  B(y,^)) 

'•4  /  yeMi(Z),dM>35/4  4 

Therefore,  by  the  Chebycheff  bound,  denoting  by  Pn  the  law  of  the  random  variable  fxn  and  by  Cj(S)  the 
space  of  continuous  functions  on  E,  it  follows  that  for  any  9  £  Cb( E), 

P(/in  €  B(n,6)c)  < 

< 


< 

< 

where  <  9,  v  >=  f  9(x)i/(dx),  the  first  equality  in  (2.12)  follows  from  the  min-max  theorem  for  convex 
compact  sets  (c.f.  Theorem  4.2  of  Sion  (1958)),  the  second  equality  follows  by  Lemma  3.2.13  of  Deuschel 
and  Stroock  (1989),  and  the  last  inequality  from  the  fact  that  (Deuschel  and  Stroock  (1989),  Exercise 
3.2.24)  for  any  9  £  B(n,6/ 2)c, 

l  <d(9,/j,)<  ||0-HU  <2Hll\9\n) 


(f.Mi(E) 


N 


lV(i?A4i(E) 


sup 


y€Mi('E),d(y,ii)>35/4 


Ib(vA) 


,n<6,i/>  -n<6,u> 


dP niy) 


1 


sup  exp  [ -n  sup  inf  (<9,i/> — log  Pp  (en<e,J/>)) 
V  0€Cb(S)  »€B(y£)  n  n 


N/S 


|,Xi(£)) 


y:  d(y,y.)>3S/4 
•  exp  ( -n 


inf 


v£B(y,\  )4M>36/4  0€Cb(E) 

N  ^,Afi(E)V  exp  ( -n  inf  P(z/|/i) 

\4  /  \  i'€B(y,%),d(y,n)>36/4  / 

n{s-,m  1(S))  •exp^-rc  jnf^.  H(v  |ju) 


sup  (<  9,  v>  —  —  log Epn(en<d'v> ))  j 
iCb(E)  n  J 


N 


,Mi(E))  -e“n(0 


(2.12) 


□ 


Corollary  1  Let  Bm  C  Afi(E)  be  a  measurable  set  such  that  //  £  Bm.  Let  Bsm  denote  an  open  set  such 
that  d(Bm,(Bsm)c)  >  S.  Then 

P  (/in  G  ( Bsm)c )  <  N  (j, Afr(E))  e-H!)2  (2.13) 


We  return  now  to  the  proposed  classification  algorithm.  Motivated  by  Corollary  1,  define 
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a(m)  = 


c(m) 


2  log  m  +  log  2  +  Nl 


Idjn) 

8 


1  -  log 


K(m) 

8 


(2.14) 
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and  let  /3(m)  be  as  defined  previously  by  (2.4). 

Note  that  with  this  choice  of  a(m),  using  Corollary  1  with  S  =  i/2e(m),  and  the  expression  for 
N(6/4,  from  Lemma  2.8,  we  have  that  for  all  //  €  A  and  m  >  too(m), 

P(»a(m)  5  (2.15) 

as  we  wanted. 

For  convenience  we  summarize  the  decision  rule  again  here. 

Decision  Rule  For  any  input  sequence  X\,X2 , . . .,  form  the  subsequences 

={xp(m— x)+l ?  '  '  '  >  xj3(m))- 

Let  fix™  denote  the  empirical  measure  of  the  sequence  Xm.  At  the  end  of  each  parsing,  decide  n  €  A  if 
fixm  €  Cm  and  decide  n  £  Ac  otherwise.  Between  parsings,  don’t  change  the  decision. 

We  now  claim: 

Theorem  2  The  decision  rule  defined  by  the  parsing  {3(m )  as  above  is  successful. 

Proof  The  proof  is  essentially  identical  to  the  proof  of  Theorem  1  in  Kulkarni  and  Zeitouni  (1991). 

a)  If  fj,  €  A,  then  by  assumption  A-l)l)  there  exists  m0(ii )  such  that  /j,  €  Bm  for  all  m  >  ma{n).  Note 
that  the  event  of  making  an  error  infinitely  often  is  equivalent  to  the  event  of  making  an  error  at 
the  parsing  intervals  infinitely  often.  However,  by  our  choice  of  a(m) 

CO  OO  J 

Y.  Prob{error  after  m— th  parsing}  <  mo(fi)  +  Y,  — 2  <  00 

m= 1  m=mo(Ai)+l 

Therefore,  using  the  Borel-Cantelli  lemma,  we  have  that  our  decision  rule  satisfies  part  (S.l)  of  the 
definition  of  a  successful  decision  rule. 

b)  Let 

OO  CO  / - - - - 

N  =  fl  (J  Cm  2e  m  \A  (2.16) 

n—\  m—n 

By  assumption  A-l)3),  G(N )  =  0.  Now,  if  /i  €  AC\N,  we  may  repeat  the  arguments  of  part  a)  in  the 
following  way:  For  an  mo(ji)  large  enough,  (i  €  (C^2^m^)c  for  all  m  >  mo(/i).  Therefore,  we  have 
d(fJL,Cm )  >  \j2e{m)  for  all  m  >  mo(ji).  Then  using  Corollary  1  with  6  =  e(m),  the  expression  for 

i?(^/4,  Afi(S)  from  Lemma  2.8,  and  the  choice  of  a(m)  we  have  that  for  m  >  ma{n) 

Prob{error  after  m— th  parsing}  =  P(fixm  €  Cm)  <  — ^  (2.17) 

Hence,  as  in  part  a),  the  result  follows  by  a  smiple  application  of  the  Borel-Cantelli  lemma. 

□ 
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3  Classification  Among  a  Countable  Number  of  Sets 


In  this  section,  we  refine  the  decision  rule  to  allow  for  classification  among  a  countable  number  of  sets. 
Specifically,  if  A*,  A2,  ■  ■  ■  are  a  countable  number  of  subsets  of  M1( E)  we  are  interested  in  deciding  to 
which  of  the  Ai  the  unknown  measure  fi  belongs.  The  only  assumption  we  make  on  the  Ai  is  that  each  Ai 
satisfies  the  structural  assumption  (A-l).  The  A;  are  not  required  to  be  either  disjoint  or  nested,  although 
these  special  cases  are  most  commonly  of  interest  in  applications.  In  general,  after  a  finite  number  of 
observations  one  cannot  expect  to  determine  the  membership  status  of  \i  in  all  of  the  A,-.  However,  we  will 
show  that  for  all  \x  except  in  a  set  of  G'-measure  zero  in  MX(Y>)  there  is  a  decision  procedure  that  a.s.  will 
eventually  determine  the  membership  of  /i  in  any  finite  subset  of  the  A{.  In  the  special  cases  of  disjoint  or 
nested  A,-,  the  membership  status  of  n  in  any  of  the  countable  A;  is  completely  determined  by  membership 
in  some  finite  subset.  Hence,  in  these  cases,  except  for  n  in  a  set  of  G'-measure  zero  the  membership  of  /r 
in  all  the  Ai  will  a.s.  be  eventually  determined. 

We  modify  our  previous  decision  rule  as  follows.  The  observations  X\,X2,...  will  still  be  parsed  into 
increasingly  larger  blocks  in  a  manner  to  be  defined  below.  However,  now,  at  the  end  of  the  m-th  block, 
we  will  make  a  decision  as  to  the  membership  of  /i  in  the  first  m  of  the  A,.  The  decisions  of  whether  n 
belongs  to  Ai , . . . ,  Am  are  made  separately  for  each  A;  using  a  procedure  similar  to  that  of  the  previous 
section. 


Specifically,  for  each  Ai  let  Biym  be  a  sequence  of  closed  sets,  CV,m  a  sequence  of  open  sets  and 
e,(m)  0  a  positive  sequence  satisfying  the  requirements  of  the  structural  assumption  (A-l).  From 

the  same  considerations  that  led  to  (2.15),  for 


oii{m)  =  2  log  m  +  log  2  +  IVs  ^ei(m)/8^  ^1  -  log  yjei(m)l% 


we  have,  for  /i  €  At, 

G  Q,m)  —  ^2 

As  before,  the  observation  sequence  *1,  #2, . . .  will  be  parsed  into  non-overlapping  blocks 


(3.18) 

(3.19) 


Xm  —  (®^(m_i)+i ,  ■  •  • ,  «/3(m)) 


(3.20) 


where  the  (3(m )  are  defined  below.  At  the  end  of  the  m-th  block,  a  decision  will  be  made  about  the 
membership  of  n  in  Ai,...,Am.  This  decision  will  be  made  separately  for  each  i  =  1  ,...,m  using  the 
observation  sequence  Xm  exactly  as  before.  That  is,  at  the  end  of  the  parsing  sequence  Xm,  for  i  =  1, . . . ,  m 
decide  that  €  A;  according  to  whether  or  not  nxm  €  and  don’t  change  the  decision  except  at  the 
end  of  a  parsing  sequence.  We  define  the  parsing  sequence  /?(m)  by  /3(0)  =  0  and  /3(m)  —  {3(m  —  1)  = 


10 


maxi<,<m  a,-(m)  or  equivalently 


771 

/Hm)  =  0(°) =  0  (3.21) 

^=1  — 

For  this  decision  rule  we  have  the  following  theorem. 

Theorem  3  Let  A{  C  Afx(S)  for  i  =  1,2,...  satisfy  the  structural  assumption  (A-l).  There  is  a  set 
N  C  Mi{ S)  of  G-measure  zero  such  that  for  every  fi  G  A11(S)  \  N  and  every  k  <  oo  the  decision  rule  will 
make  (a.s.)  only  a  finite  number  of  mistakes  in  deciding  the  membership  of  fi  in  Aj, . . . ,  A&.  That  is,  given 
any  fi  G  Af1(S)  \  N,  for  a.e.  u  there  exists  m(u>)  =  m(u>,fi,k)  such  that  for  all  m  >  m(w)  the  algorithm 
makes  a  correct  decision  as  to  whether  fi  G  A;  or  fi  G  A£  for  i  =  1, . . .,  k. 

Proof  Let 

»i=f]U  \  A,  (3.22) 

n=l  m=n 

and  let 

OO 

N=\jNi  (3.23) 

!  =  1 

Then  from  the  assumption  (A-l)  it  follows  that  the  G-measure  of  each  IV,-  is  zero,  and  so  the  G- measure 
of  N  is  also  zero. 

Now,  let  n  €  Af1(S)  \  N ,  and  let  k  <  oo.  For  each  i  =  1, . . .,  k,  there  exists  mi(fi)  <  oo  such  that  if 
H  €  Aj  then  fj,  €  for  all  m  >  m;(/i),  while  if  ^  €  A\  then  /z  G  (G-^2^m^)c  for  all  m  >  rrii(fi)  (since 
H  £  Ni).  Recall  that  at  the  end  of  the  parsing  sequence  Xm,  the  algorithm  decides  fi  €  A,-  iff  fix™  £  G,-)TO, 
so  that  if  fi  G  A,-  then  an  error  is  made  about  membership  in  A,-  iff  fixm  £  G,,m  while  if  n  4-  A,-  an  error 
is  made  iff  fixm  €  G,iTO.  If  fi  €  A,  then  using  Corollary  1  and  the  fact  that  Clm)  >  \/2c,-(m),  we 

have  that  the  probability  of  making  an  incorrect  decision  is  less  than  1/m2  for  m  >  m,(/z).  On  the  other 
hand,  if  fi  G  A?  then  since  d{Ci  m^  {c\^2^m^)c)  >  -y/2 c,(m)  we  also  have  probability  of  error  less  than 
1/m2  for  m  >  mi(fi)  (again  using  Corollary  1  and  the  expression  for  a(m)).  Hence,  for  m  >  mo(/z)  = 
max(m(/z), . . . ,  mk(fi ))  the  probability  of  making  an  error  about  the  membership  of  fi  in  any  of  Aj, . . . ,  A* 
is  less  than  k/m2.  Then 

OO  oo  1 

y]  Prob{error  in  any  A,  on  m— th  parsing)  <  mo  +  k  ^  — 2  <  00 

m= 1  m=mo+l  ^ 

so  that  the  theorem  follows  by  the  Borel-Cantelli  Lemma. 


□ 


Note  that  if  one  also  wants  to  make  a  correct  decision  after  some  finite  time  whether  or  not  fi  is  in  any 
of  the  Ai  for  i  =  1,2,...  then  the  decision  procedure  can  be  easily  modified  to  handle  this.  Specifically, 
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it  is  easy  to  show  that  sets  satisfying  the  structural  assumption  are  closed  under  countable  union.  Hence, 
one  could  include  in  the  hypothesis  testing  the  set  Aq  =  U^A,-,  so  that  after  some  finite  time  a  correct 
decision  would  be  made  about  the  membership  of  p.  €  Aq. 

Also,  it  is  worthwhile  to  note  that  if  the  A,  have  more  structure  then  some  improvements  can  be  made. 
For  example,  if  the  membership  status  of  /x  in  A,  for  i  =  1,2,...  is  determined  by  its  membership  status  in 
some  finite  number  of  the  A;  then  a  correct  decision  regarding  the  membership  of  /x  in  all  of  the  A,-  can  be 
guaranteed  (a.s.)  after  some  finite  time  (depending  on  /x).  This  is  the  case  for  disjoint  or  nested  A,-,  which 
may  be  of  particular  interest  in  some  applications.  For  these  cases,  by  letting  A0  =  U^A;  and  running 
the  decision  rule  on  Ao,  Ai,  A2, ...  as  mentioned  above,  we  have  the  following  corollary  of  Theorem  3. 

Corollary  2  Let  Ai  C  Af1(S)  for  i  —  1,2,...  satisfy  the  structural  assumption  (A-l)  and  suppose  the  A,- 
are  either  disjoint  or  nested.  There  is  a  set  N  C  M\{ E)  of  (7-measure  zero  such  that  for  every  /x  €  M  1(E)\iV 
the  decision  rule  will  make  (a.s.)  only  a  finite  number  of  mistakes  in  deciding  the  membership  of  /x  in  all 
of  the  Ai.  That  is,  given  any  fj,  €  M1( S)  \  A,  for  a.e.  ui  there  exists  m{u>)  =  m(u;,/x)  such  that  for  all 
m  >  m{u>)  the  algorithm  makes  a  correct  decision  as  to  whether  /x  €  A,-  for  all  x  =  1,2,.. .. 

It  is  worthwhile  to  note  that  the  results  of  this  section  may  be  used  also  in  the  case  that  £  is  locally 
compact  but  not  compact.  In  that  case,  one  may  first  intersect  the  A,-  with  compact  sets  Km  which 
sequentially  approximate  £  and  then  use  m(n)  — *  00.  We  do  not  consider  this  issue  here. 

We  conclude  this  section  with  an  example  taken  from  the  problem  of  density  estimation.  Let  £  =  [0, 1] 
and  assume  that  xi,...,xn  are  i.i.d.  and  drawn  from  a  distribution  with  law  fig,  9  €  O.  When  some 
structure  is  given  on  the  set  T  =  Use©  there  exists  a  large  body  of  literature  which  enables  one  to 
obtain  estimates  of  the  error  after  n  observations  (e.g.,  see  Ibragimov  and  Has’minskii  (1981)).  All  these 
results  assume  an  a-priori  structure,  e.g.  a  bound  on  the  L 2  norm  of  the  density  fg  =  If  such 

information  is  not  given  a-priori,  it  may  be  helpful  to  design  a  test  to  check  for  this  information  and  thus 
to  be  able  to  estimate  eventually  whether  the  distribution  belongs  to  a  nice  set  and  if  so  to  apply  the  error 
estimates  alluded  to  above.  The  application  of  such  an  idea  to  density  estimation  was  suggested  by  Cover 
(1972). 

As  a  specific  example,  let 

Note  that  the  sets  Ai  are  closed  w.r.t.  the  Prohorov  metric  and  therefore  they  satisfy  the  structural 
assumption  A-l).  Moreover,  they  are  nested  and  thus  Corollary  2  may  be  applied  to  yield  a  decision  rule 
which  will  asymptotically  decide  correctly  on  the  appropriate  class  of  densities. 

A  somewhat  different  application  to  density  estimation  arises  when  the  A,-  consist  of  single  points  (i.e., 
each  Ai  contains  a  single  probability  measure).  The  special  case  in  which  A,-  consists  of  the  x-th  computable 
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density  is  related  to  a  model  considered  by  Barron  (1985)  and  Barron  and  Cover  (1991).  For  an  estimation 
procedure  based  on  the  Minimum  Description  Length  (MDL)  principle,  they  showed  strong  consistency 
results  when  the  true  density  is  a  computable  one.  Since,  there  are  a  countable  number  of  computable 
densities  and  the  structural  assumption  A-l)  is  satisfied  for  any  singleton,  a  strong  consistency  result  for 
computable  densities  follows  immediately  from  our  results. 

4  Applications  to  Order  Determination  of  Markov  Processes 

In  this  section,  we  extend  the  model  of  the  observations  to  allow  for  a  Markov  dependence  in  the  observation. 
The  problem  we  wish  to  consider  is  the  order  selection  problem:  given  observations  from  an  (unknown) 
Markov  chain,  one  wishes  to  estimate  the  order  of  the  chain  in  order  to  best  fit  a  Markov  model  to  the 
data. 

Specifically,  let  £  be  a  compact  Polish  space  as  before,  but  assume  that  the  observations  x\, . .  .,xn  are 
the  outcome  of  a  Markov  chain  of  order  j ,  i.e. 

Prob(xfc  £  A|x/,_i ,Xfc_2,...,xi)  —  k  £  A|xk_i,Xfc_25***j % k—j ) 

where  A  is  a  Borel  measurable  subset  of  £  and  k  >  j.  In  order  to  avoid  technicalities,  we  assume  that  all 
Markov  chains  involved  are  ergodic,  and  therefore  there  exists  a  unique  stationary  measure  P^i  €  Af  i(S^) 
such  that  for  any  measurable  set  A  in  £J , 

P*)(A)  =  /  dn3(x2j\x2j-i,...,xj)---d'n-:>(xj+1\xj,...,x1)dP^{xj,...,Xi)  (4.24) 

We  assume  that  j  is  unknown,  and  our  task  is  to  decide  (correctly)  on  the  order  j. 

This  problem  has  already  been  considered  in  the  literature.  Hannan  and  Quinn  (1979)  and  later  Hannan 
(1980)  considered  the  case  of  autoregressive  and  ARMA  models,  and  proved,  under  some  assumptions,  the 
consistency  of  an  estimator  based  on  the  Akaike  criterion.  For  a  related  work,  see  Shibata  (1980).  In  all 
the  above,  an  effort  is  made  also  to  prove  asymptotic  optimality  of  the  proposed  estimators.  In  the  discrete 
alphabet  (finite  £)  set-up,  Merhav  et  al.  (1989)  proposed  an  estimator  based  on  relative  entropy,  related 
it  to  the  Lempel-Ziv  compression  algorithm,  and  proved  its  asymptotic  optimality  in  the  sense  of  large 
deviations.  However,  their  approach  does  not  guarantee  in  general  a  zero  probability  of  error  and  may 
result  in  biased  estimates. 

In  this  section,  we  depart  from  the  above  by,  on  the  one  hand,  relaxing  the  requirement  for  “asymptotic 
optimality”  and,  on  the  other  hand,  considering  the  general  setup  of  Markov  chains.  We  show  how  a 
strongly  consistent  decision  rule  may  be  constructed  based  on  the  general  paradigm  of  this  paper.  Towards 
this  end,  we  need  to  extend  the  basic  estimates  of  Section  2  to  the  Markov  case,  as  follows. 
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Let  fi  =  Sz,  define  Xj  to  be  the  coordinate  map  aq(o>)  =  and  let  the  shift  operator  be  defined  by 
Xi(Tuj)  =  X{+i(u>).  Define  the  k-th  order  empirical  measure  on  A4i(Efc)  by 


k  i  n 

V'n  —  ^xi  (T'w),X2(T'w),... 


Tl  f  . 
t=l 


,xk(T'w) 


As  before,  we  endow  Afi(Efc)  with  the  Prohorov  topology,  and  recall  that  a  large  deviations  upper  bound 
holds  for  the  empirical  measure  /4+1,  viz.  for  any  set  A  C  Afi(SJ+1)  a  large  deviations  statement  of  the 
form  (2.1)  holds,  with  the  relative  entropy  H{v\n)  being  replaced  by 


\L 

y  oo 


. *h>n ...,»)  (4.25) 

otherwise 


For  any  measure  p(xi, . . . ,  Xk)  G  M i(Efc),  denote  by  //,•  the  marginal  defined  by 


iEt}  €  A )  —  /7({2q,...,;rj}  €  A,  €  S  ) 


and  by  *  the  regular  conditional  probability  /i(£,  |a:;_i, . . . ,  a;,_4).  With  a  slight  abuse  of  notations, 

we  continue  to  use  m  for  the  marginal  of  a  measure  p  €  Mi{T,z).  Define  the  measure  p  =  pi-k  0 
0  •  •  •  0  Pi\i-iy...,i  G  Afi(S8)  as  the  measure  which,  for  any  measurable  set  A  C  E*, 

/^(A)  =  /  dfli—k^Xi ,  .  .  . ,  (&— l)|i— fc, ...,1  (®i— (fc—  1)  l®i— •  •  •  >  ®l)  '  '  '  ^Pi\i— (®t  l*t— 1)  •  •  •  ?  ®l) 

>7.4 

(4.26) 

Let  7rJ  be  a  given  j-th  order  Markov  kernel,  P 7 r>  its  corresponding  stationary  measure,  and  denote  by 
Pr^  the  stationary  measure  on  O  generated  by  this  kernel.  Assume  that  the  empirical  measures  pj^k, 
k  =  2,3, .. .  are  formed  from  a  Markov  sequence  generated  by  this  kernel.  In  order  to  compute  explicitly 
the  sequence  of  decision  rules  as  in  the  i.i.d.  case,  we  need  to  derive  the  analog  of  Theorem  1  given  below. 

Theorem  4 


Pr ** 


p3n+k  £  B  (( p3n+2)j  0  7 rJ  ®  •  •  ■  ®  tp\  <5))] 
N  Af1(EJ+fc)^  e-«(3l)2 


+••■+  s(j J«'-M‘(s''+I))e“”(5Ar)2 

<  iJv(^,X,(E'+‘))e-“(3w)s  (4.27) 


Proof  We  prove  the  Theorem  first  for  the  case  k  =  2.  The  general  case  follows  by  induction.  For  any 
v  G  Mx(EJ+2),  let 

A(i>,  6)  =  {n  G  A4i(E'3+2)  :  d{n,Vj  ®  7rJ  ®  7rJ)  <  <5} 

C(M)  =  {m  G  A4i(E2+2)  :  d(w+1  ®  7T2)  <  6/4}  . 
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It  follows  that 


Pr*  <  Pr*3 U+2  ?  AU+2,6)^i+2  €  CU+2,S)) 

+Pr*jU+>?CU+2,6)) 


By  repeating  the  argument  in  (2.12), 

Pi 

Si+2))  yejW!  (&■+*) 


Pi  +  P2  (4.28) 


l4+2  €  B{y,-),ni+2  €  CU+2,S),yt  A(fi{+2 ,3S/4)  (4.29) 


Therefore,  by  the  Chebycheff  bound,  denoting  by  P„  the  law  of  the  random  variable  //£+2,  it  follows  that 
for  any  9  €  C,i(£J+2), 

Pi 


N  (f,A4i(&'+2))  "  ;/eA?iU(S>+2)  X(y,|) 


en<e,v>e  n<9,»>  l^2eC(^2  s)  yiA(^2  z6/4))dPn(v) 


<  sup  exp 

3/£A<l(&+2) 


— n 


sup  inf  (<9,u>-~  log  EPn  (en<6’u> )) 

VEJ+2)  u€B(yA)C\C(v,S)  \  n  ) 


0€C6(£J+2)  ^€B(y.|)f)c( 

y£A(i/,36/4) 


exp 


-n  inf 

•'eB(y,$)f)C(v,S)  0eC{>(&'+2) 
y£A(v,3S/4) 


sup  (<  9,1/  >  -ilogPpn(e"<e’I/>)N) 
:j&'+2)  \  n  ) 


(4.30) 


However, 


sup  f<  9,i/  >  --log EPn(en<6'u>)\ 
«ecb(£J+2)  V  n  J 

=  sup  (<  9,  v  >  --\ogEpn(ee(Xu-'X:>+2',+"'+e(Xn-:>-2 . Xn 

fi6Ct(£>+2)  V  11 

=  sup  (<  9,v  >  --\ogEPn{e0{xu-'x^+-+e(xn-i-2 . . 

BeB(TJ+2' i  V  n 


(4.31) 


0e£(£j+2) 

where  P(£J+2)  denotes  the  space  of  bounded  measurable  functions  on  £J+2  and  the  last  equality  follows 
from  dominated  convergence.  We  assume  now  that  v  is  absolutely  continuous  w.r.t  i/j+i  (g>  7rJ  and  that  the 
resulting  Radon-Nikodym  derivative  is  uniformly  bounded  from  above  and  below  (these  assumptions  may 
be  relaxed  exactly  as  in  Deuschel  and  Stroock  (1989),  pg.  69).  In  this  case,  we  may  take  9  in  (4.32)  as  this 
Radon-Nikodym  derivative,  i.e.  9( xi, . . . , 2^+2)  =  log  du  to  obtain: 

sup  (<6,v>  --log £(c«(*i,-,*>+2)+-+«(*n-J--i, -,*»))')  >  H(p\uj+1  ®  7 vj)  (4.32) 

e€0(S7+2)  V  n  J 
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Substituting  (4.32)  in  (4.30)  and  recalling  the  inequality 


one  obtains 


tf(I 


Similarly, 


<  exp 


<  exp 


<  exp 


(  \ 

n  inf  0  it3) 

.  ‘'es(s,,£)p)c(M) 

\  y£A(v,36/4)  ) 

(  \ 

—n  inf  d(u,  Pj+i  ®  7rJ)2/4 

\  y£A(v,36/4)  ) 

( 

inf  (d(v,  Vj  7T7)  -  d(z/J+ 1  ®  7rJ,  vj  0  0  t-*)) 


\  y£A(i/,3S/4) 

<  exp  (— n  (8/ 16)2) 


P2 


<  exp  {—n  (8/ 64)2j 


(4.33) 

(4.34) 


N({-6,M1(^+ 1)) 

Substituting  (4.33),  and  (4.34)  in  (4.28)  yields  the  Theorem  for  k  =  2.  The  general  case  is  similar  and 
follows  by  induction. 


□ 

We  are  now  ready  to  return  to  the  order  determination  problem  described  in  the  beginning  of  this 
section.  Since  the  set  up  here  differs  slightly  from  the  one  described  in  the  previous  section,  we  repeat  here 
the  main  definitions. 

Let  A{  C  Mi(Q),  i  =  0, 1, . . .,  be  the  set  of  stationary  measures  generated  by  Markov  chains  of  order  i 
(with  i  =  0  denoting  the  i.i.d.  case),  i.e.  for  i  =  0, 1, 2, . . ., 

H  €  Ai  <=>  (fJ-)i+k  =  (^)j  ®  A"*  0  •  •  •  0  7T*  for  some  Markov  kernel  7r*  and  for  all  k  =  1, 2, _ 

Note  that  the  sets  Ai  are  closed,  so  that  we  may  take  =  At  in  assumption  A-l). 

Natural  candidates  for  the  covering  sets  C!iTO  are  v/2c(m)  dilations  of  the  Ai.  That  is,  let  8m  =  \J2 e(m) 
and  define 

Ci,m.  =  {v  €  Ali(S8+m)  :  d(v ,  (z/),-  ®  7r*  ®  •  •  •  ®  7r4)  <  8m  for  some  Markov  kernel  7r*} 
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and  let 


Ci^rn  —  {y  €  «Adi(£i)  l  €  Cijtn}  • 

It  is  clear  that  C*,m  is  open,  and  also  that 

OO  OO  > - ; - - 

n  u  = ». 

n=l  m—n 

Therefore,  by  using  Theorem  4  and  the  procedure  described  in  Theorem  3,  the  sets  C^m  are  candidates  for 
building  a  decision  rule  which,  a.s.,  decides  correctly  in  finite  time  whether  the  given  observation  sequence 
was  generated  by  a  Markov  chain  of  order  i.  In  order  to  be  able  to  do  so,  we  need  only  to  check  that 
the  complements  of  the  sets  which  are  closed,  have  the  property  that  they  may  be  covered  by  small 
enough  spheres  (say,  8m/4  spheres),  such  that  the  union  of  those  spheres  belongs  to  the  complement  of 
some  Ci,m>.  This  can  be  seen  by  using  the  following  lemma. 

Lemma  2  Let  v,  v'  €  Mi(Ek).  Assume  that  for  some  i  <  k  -  2, 

d(v,  (v)i  ®  70  0  ■  •  •  0  70)  >  Sm. 

Further  assume  that  d(v,v')  <  Sm/4.  Then 

d(u',  ( v'){  0  70  0  •  •  •  <gl  70)  >  Sm/ 2. 

Proof  Note  that  d{v,v')  <  8m/4  implies  that  d{(v)i,  (i/),-)  <  6m/ 4  and  that  <  Sm/4.  On 

the  other  hand,  since  70  is  a  Markov  kernel,  it  also  follows  that  d((y)i  ®  70,  (v')i  %  70)  <  Sm/4  and  therefore 
also  that  d((z/),-  ®  tt'  ®  ■  ®  7 t’,  (i/),-  0  70  ®  ■  ■  ■  0  70)  <  Sm/ 4.  Hence, 

d(i/,  {v')i  0  70  0  •  •  •  0  70)  >  d(u,  (v)i  ®  70  ®  ■  ■  •  0  70)  -  d(v,  v') 

—d({v)i  0  7 0  0  •  •  •  0  t 0,  (z/),  0  70  ®  •  •  ■  0  tt*) 

>  8m/2 


We  have  now  completed  all  the  preparatory  steps  required  for  the  definition  of  the  proposed  decision 
rule.  Indeed,  let  e;(m)  be  a  sequence  of  positive  numbers,  define 

[3logm  +  iVs'+m  ^y4;(m)/24TO+3^  ^1  —  log  yj €i(m) /24m+3 


a,(m)  = 


€i(m)  L 


(4.35) 


We  have,  by  Lemma  2  and  Theorem  4,  that  for  any  Markov  measure  n  G  A{, 


WX)  €  Cf,m)  <  ^ 


m ‘ 


(4.36) 
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where  we  have  used  (4.27).  The  construction  of  the  decision  rule  is  then  identical  to  the  one  described 
in  Theorem  3,  i.e.  one  forms  the  parsing  of  the  observation  sequence  into  the  nonoverlapping  blocks  Xm 
described  in  equation  (3.20)  with  0(m)  chosen  as  in  (3.21).  At  each  step,  one  forms,  based  on  the  block 
Xm ,  the  empirical  measures  of  order  m,m  +  1, . .  .,2m.  The  order  estimate  at  the  m-th  step  is  now  the 
smallest  i  such  that  €  C*,m-  By  the  results  of  section  3,  this  decision  rule  achieves  a.s.  only  a  finite 

number  of  errors,  regardless  of  the  true  order. 

Appendix 

For  completeness,  here  we  prove  a  lower  bound  for  the  covering  number  of  Afi(E)  with  respect  to 
the  Prohorov  metric.  This  lower  bound  exhibits  a  behavior  similar  to  the  upper  bound  of  (2.8),  so  that 
these  bounds  cannot  be  much  improved.  In  the  proof  below,  M(e,Y,  rj)  denotes  the  e-capacity  (or  packing 
number)  of  the  space  Y  with  respect  to  the  metric  r\.  That  is,  M(e,Y,rj )  represents  the  maximum  number 
of  non-overlapping  balls  of  diameter  e  with  respect  to  the  metric  rj  that  can  be  packed  in  Y.  The  well 
known  relationship 

N( 2e,  Y,  t?)  <  M( 2e,  Y,r/)  <  N(e,  Y,  v) 

between  covering  numbers  and  packing  numbers  is  easy  to  show  and  is  used  in  the  proof  below.  Note 
that  for  a  Polish  space  £  with  metric  rj,  we  use  the  notations  lV(c,  £,?;)  =  fVs(e)  and  N(e,M1(E),d)  = 
N(e,M\X)). 

Lemma:  Let  £  be  compact  Polish  space  with  metric  ??,  and  let  A11(S)  denote  the  set  of  probability 
measures  on  £  with  the  Prohorov  metric  d.  Then 

N(e,Ml(E))  >  8eyArE(2e)  (^-j 

Proof:  First,  we  can  find  N  =  fVs(c)  points  x\,. .  .,xn  which  are  pairwise  greater  than  or  equal  to  e 
apart.  Each  measure  supported  on  these  N  points  corresponds  to  a  point  in  JRN  in  the  natural  way.  Then, 
the  set  of  all  probability  measures  supported  on  Xi, . . ,  ,  xn  corresponds  to  the  simplex  SN  in  mN. 

Now,  let  p,q  be  points  on  the  simplex  SN  and  suppose  that  dp(p,q)  >  2c  where  dp  =  I  Pi  ~  Qi\- 
Then  on  some  subset  G  C  (1, . . . ,  N}  of  coordinates  either  YheG  Pi  <  XweG  ft  +  c  or  J2ieG  ft  ^  YlieoPt  +  e- 
Then,  considered  as  probability  measures  on  E,  d(p,q)  >  e  since  there  is  a  closed  set  F  C  E,  namely 
F  =  { Xi  |i  €  (?},  for  which  either  p(F)  >  q(Fe)  +  c  or  q(F)  >  p(Fe)  +  c.  Hence, 

N(e/ 2, M 1(E),  d)  >  M(e, M\ E),  d)  >  M(2c,  SN, dtx )  >  JV( 2c,  SN, de) 

Finally,  to  get  a  lower  bound  on  N(2e,SN,dei),  we  note  that  the  N  -  1  dimensional  surface  measure 
of  the  simplex  SN  is  \fN /( N  -  1)!  (simply,  differentiate  the  jV-dimensional  volume  of  the  interior  of  an 
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x-scaled  simplex  with  respect  to  x,  taking  the  angles  into  account).  On  the  other  hand,  note  that  the 
N  —  1  dimensional  volume  of  the  intersection  of  SN  with  an  N  dimensional  l1  ball  of  radius  2c  is  not 
larger  than  the  volume  of  an  N  —  1  dimensional  i1  ball  of  radius  2c,  which  equals  (4 c)iv_1/(lV  —  1)!. 
Thus,  lV(2e, SN,dei)  >  (1  Thus,  N{e/2,M\^))  >  (l/4e)N^4 e^N^(e),  or  equivalently 

JV(c,  M^E))  >  8c^lVS(2e)(l/8c)ivE(2£). 

□ 
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